查看原文
其他

热点追踪 | 机器翻译中,黄金标准翻译并非总是最佳选择

王曦 国际翻译动态
2024-09-09

In Machine Translation, Gold-Standard Translations Are Not Always Gold 

机器翻译中,黄金标准翻译并非总是最佳选择

(图片来自slator官网)

In a February 2, 2024 paper, machine translation researchers from Johns Hopkins University and Microsoft emphasized that gold-standard translations are not always “gold”.

2024年2月2日发表的一篇论文中,约翰斯·霍普金斯大学和微软的机器翻译研究人员强调,黄金标准翻译并非总是最佳选择。

They introduced a new fine-tuning approach called contrastive preference optimization (CPO) that aims to help models avoid generating near-perfect yet flawed translations by using carefully curated preference data.

他们引入了一种新的微调(fine-tuning)方法,称为对比偏好优化(CPO),使用精心策划的偏好数据,旨在帮助模型避免生成“白璧微瑕”般的翻译。

CPO is a “more efficient variant” of direct preference optimization (DPO) that integrates preference learning into the training process. The researchers suggested that implementing CPO could significantly enhance the performance of moderate-sized large language models (LLMs) in machine translation (MT), matching or even surpassing the capabilities of GPT-4.

直接偏好优化(DPO)策略将偏好学习整合到训练过程中,而对比偏好优化(CPO)是直接偏好优化(DPO)“更有效的变体”。研究人员表示,实施CPO可以显著提升中等规模大型语言模型(LLMs)在机器翻译(MT)中的性能,能够匹敌、甚至超越GPT-4的能力。

They explained that CPO addresses two main issues with traditional supervised fine-tuning (SFT) methods, pushing “the performance boundary of models that have reached saturation through SFT training.”

他们解释道,CPO解决了传统监督微调(SFT)方法导致的两个主要问题。在模型通过SFT训练达饱和状态时,CPO拓展了模型的性能边界。

1


Firstly, SFT focuses on making model outputs match reference translations, thus potentially limiting the model’s performance to the quality of the training data, which might not always be perfect. “Even human-written data, traditionally considered high-quality, is not immune to quality issues,” they said. Their analysis of the FLORES-200 dataset revealed instances where the quality of human-written parallel data was even inferior to that of system-generated translations. This finding led them to question the effectiveness of training models solely based on replicating reference translations.

首先,SFT的侧重于确保模型输出与参考翻译相匹配,因而可能将模型性能局限在训练数据的质量范围内,然而这并非总是完美无瑕。他们表示,“即便是人工编写的数据,传统意义上认为是高质的,也难免存在质量问题。”他们对数据集FLORES-200的分析揭示了人工编写的平行数据质量甚至不如系统生成翻译的质量。由此,他们开始质疑仅基于复制参考翻译训练模型的有效性。

2


Secondly, SFT lacks a mechanism to prevent the model from making its own mistakes. Sometimes, even though a translation may seem good, it might contain small errors like missing words, they explained. CPO helps address these problems by training the model to avoid producing near-perfect but ultimately flawed translations, leading to significant enhancements in translation performance, surpassing the capabilities of traditional SFT methods.

其次,SFT缺乏防止模型自身出错的机制。他们解释道,有时,翻译表面看起来不错,但可能包含一些小错误,比如漏词。CPO通过训练模型,从而避免产生那些看起来完美、但实则存在缺陷的翻译,进而解决这些问题,大幅提高翻译性能,其有效性超过传统SFT方法。

高质量偏好数据集

High-Quality Preference Dataset

CPO requires access to labeled preference data, yet such data is scarce in MT. To facilitate the implementation of CPO, the researchers built and released a high-quality preference dataset for ten language pairs: English  German, Czech, Icelandic, Chinese, and Russian.

CPO需要访问偏好数据标注,然而此类数据在机器翻译中十分稀缺。为了落实CPO,研究人员构建并发布了一个高质量偏好数据集,涵盖了十种语言对,包括英语与德语、捷克语、冰岛语、汉语和俄语。

This dataset, derived from the FLORES-200 dataset, includes three translations per source sentence: the original target reference, a translation from GPT-4, and a translation from ALMA. The highest-scoring translation is labeled as preferred, while the lowest-scoring translation is labeled as dis-preferred. “This approach of using high-quality but not flawless translations as dis-preferred data aids in training the model to refine details and achieve perfection in generated translations,” they explained.

该数据集源自FLORES-200数据集,每个源句包括三种翻译:原始目标参考翻译、GPT-4翻译以及ALMA翻译。得分最高的翻译标记为首选,而得分最低的翻译标记为次选。他们解释说:“利用‘白璧微瑕’的翻译作为次选数据,此法有助于训练模型,完善细节并生成十全十美的翻译。”

显著进步

Significant Advancement

The researchers further fine-tuned the ALMA-13B-LoRa (Advanced Language Model-based trAnslator), an LLM released in 2023, which is “one of the top moderate-size language-model based translation systems” surpassing even larger models such as GPT-3.5 or conventional models such as NLLB-54B.

ALMA-13B-LoRa(基于译者的高级语言模型)是一款发布于2023年的大型语言模型,被誉为“顶级中等规模语言模型翻译系统之一”,研究人员进一步对ALMA-13B-LoRa进行了微调,其性能甚至超越了大型模型GPT-3.5或传统模型如NLLB-54B等。

They compared the new fine-tuned model, named ALMA-13B-R, against other recently released 13B LLM-based models, as well as top-performing translation systems like GPT-4 and TowerInstruct.

他们将新的微调模型命名为ALMA-13B-R,与其他近期发布的13B 基于大型语言模型以及GPT-4、TowerInstruct等性能优异的翻译系统进行比较。

The results demonstrated that ALMA-13B-R either matched or even outperformed these advanced translation models, showcasing that the application of the CPO method to fine-tune the ALMA-13B-LoRA significantly enhances the model’s capabilities, bringing its performance to the level that is equal or even surpasses that of GPT-4. For the evaluation, they used wmt23-cometkiwi-da-xxl, XCOMET-XXL, and wmt22-cometkiwi-da.

对比结果表明,ALMA-13B-R与这些高级的翻译模型势均力敌,甚至表现更好,体现了将CPO方法应用于微调ALMA-13B-LoRA,显著增强了模型的能力,使其性能达到与GPT-4齐头并进、甚至超过的水平。在评估中,他们使用了wmt23-cometkiwi-da-xxl、XCOMET-XXL和wmt22-cometkiwi-da模型。

Finally, the researchers noted that CPO not only improves the translation capabilities but also offers advantages in terms of memory efficiency and speed, concluding that this marks “ a significant advancement in the field of MT.”

最后,研究人员指出,CPO不仅提升了翻译能力,而且在内存效率和速度方面具有优势,总结称这标志着“机器翻译领域的重大进步”。

原文网址:

https://slator.com/in-machine-translation-gold-standard-translations-are-not-always-gold/


特别说明:本文内容选自slator官网,仅供学习交流使用,如有侵权请后台联系小编删除。



- END -



摘译编辑:王曦

推文编辑:袁玉兆

指导老师:梁晨

项目统筹:李梦轶  王雨晴



▶ 国际翻译动态

| 翻译公司篇 | TransPerfect简介

| 翻译公司篇 | 全球第2名 RWS如文思

| 翻译公司篇 | “收购狂魔”Keywords Studios

| 咨询机构篇 | Nimdzi Insights简介

| 咨询机构篇 | Slator 简介

| 咨询机构篇 | CSA Research 简介

| 行业机构篇 | 国际翻译家联盟FIT

| 行业机构篇 | 美国翻译协会ATA

| 行业机构篇 | 加拿大联邦翻译局 Canada's Translation Bureau

| 翻译院校篇 | 明德大学蒙特雷国际研究学院(MIIS)

| 翻译院校篇 | 格拉斯哥大学

| 翻译院校篇 | 埃塞克斯大学

| 热点追踪 | ChatGPT的伦理问题(上)

热点追踪 | ChatGPT的伦理问题(下)


欢迎大家点赞关注,支持我们~

继续滑动看下一个
国际翻译动态
向上滑动看下一个

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存